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SUMMARY 

This paper presents a simple method for computing a shortest sequence of insertion and deletion 
commands that converts one given file to another. The method is particularly efficient when the 
difference between the two files is small compared to the files’ lengths. In experiments performed 
on typical files, the program often ran four times faster than the UNIX diff command. 
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INTRODUCTION 

A file comparison program produces a list of differences between two files. These 
differences can be couched in terms of lines, e.g. by telling which lines must be inserted, 
deleted or moved to convert the first file to the second. Alternatively, the list of 
differences can identify individual bytes. Byte-oriented comparisons are useful with 
non-text files, such as compiled programs, that are not divided into lines. 

The approach adopted here is to generate only instructions to insert or delete entire 
lines. Since lines are treated as indivisible objects, files can be treated as containing lines 
consisting of a single symbol. In other words, an ii-line file is modelled by a string of n 
symbols. 

In more formal terms, the file comparison problem can be rephrased as follows. The 
’ ! edit distance between two strings of symbols is the length of a shortest sequence of 

insertions and deletions that will convert the first string to the second. The goal, then, is 
to write a program that computes the edit distance between two arbitrary strings of 
symbols. In addition, the program must explicitly produce a shortest possible edit script 
(i.e. sequence of edit commands) for the given strings. 

Other approaches have been tried. For example, Tichy 1 discusses a file-comparison 
tool that determines how one file can be constructed from another by copying blocks of 
lines and appending lines. However, the ability to economically generate shortest- 
possible edit scripts depends critically on the repertoire of instructions that are allowed 
t in the scripts. 2 

File comparison algorithms have a number of potential uses beside merely producing 
a set of edit commands to be read by someone trying to understand the evolution of a 
program or document. For example, the edit scripts might be text editor instructions 
that are saved to avoid the expense of storing nearly identical files. Rather than storing 
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two long files, just one of the files and a (presumably short) file containing instructions 
like 

replace lines 6-8 by the line "*s + + = *t+ + ;" 

is stored. Rochkind 3 and Tichy 4 discuss version control systems based on this technique. 

As further testimony to the range of uses for file comparison techniques, one of the 
earliest algorithms for the problem was invented simultaneously by biologists interested 
in comparing long molecules such as proteins and by speech processing experts trying to 
compare spoken words with known words. 3 These algorithms have also been used in 
information retrieval systems to cope with spelling mistakes 6 and for video redisplay, 
where the problem is to send the computer terminal a minimal set of displav 
modification commands that will bring the image up to date.' 

STRAIGHTFORWARD APPROACHES 

The simplest method of file comparison is to look through the two files line by line until 
they disagree, then search forward in the files until a matching pair of lines is found. 
Regardless of the strategy for resynchronization, this simple approach suffers from the 
defect of sometimes producing edit scripts that are much longer than necessary. 

To see what goes awry, let the first string consist of n repetitions of the six symbols 
axxbxx and let the second string be derived by adding bxx to the front of the first string. 
Most of the simple resynchronization strategies will match xs. For the case n — 1, the 
matching looks as follows: 

Simple strategy 

axxbxx 

II II 

bxxaxxbxx 

Optimal strategy 

axxbxx 

bxxaxxbxx 

For each of the n segments in the general case, a and b are deleted, then inserted in the 
opposite order; bxx is inserted at the end. This produces an edit script of length 4n + 3. 
On the other hand, the minimal edit script just inserts bxx at the front. 

In practice, it is common for file comparison programs to require that several 
consecutive lines match before resynchronization. In the above example, the same edit 
script of length 4n + 3 is produced even if resynchronization requires that two 
contiguous symbols must match. But if three symbols must match before resynchroniza- 
tion is achieved, then the optimal script is found. None the less, the example can be 
modified to show that any resynchronization strategy that looks for k aligning symbols is 
not optimal. Moreover, for large k more time is required and matching substrings of 
length less than k will be missed. 

The failure of simple file comparison algorithms is more than just a theoretical 
curiosity; it can easily happen in practice. For example, suppose that a procedure is 
added to the beginning of a source file for a program and this file is then compared 
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against the original file. Where is the first point that lines of the files match? Quite 
possibly, the match occurs at the ends of the first procedures in each file where there are, 
for example, several blank lines. If the file comparison program resynchronizes at this 
point by, in effect, removing the first procedure from each file, then the resulting 
situation is the same as at the start: file2 is just file l with an additional procedure tacked 
on the front. Thus the algorithm may report that the two files are entirely different, 
except for blank lines. 


AN EFFICIENT ALGORITHM 

The algorithm described below always produces a shortest possible edit script. It works 
very well on files where the differences are small, and poorly only when the files are quite 
different. These performance characteristics make for a very efficient comparison tool in 
the frequently occurring situations where differences are expected to be small. An earlier 
algorithm 8 enjoys the same general property, but is substantially more complicated. The 
detailed analysis verifying the correctness and time complexity of the algorithm is 
deferred until the next section. 

All the information needed to compute the edit distance between strings A and B can 
be determined by comparing every element of A with every element of B. Thus, 
denoting the length of A by m and the length of B by n, mXn comparisons suffice A 
systematic approach is developed that uses three rules to build up a solution from the 
solutions to the subproblems that are obtained by considering initial segments of the 
given strings. Which subproblems need to be solved cannot be determined until the final 
solution is in hand. This fact is confirmed by the straightforward algorithms, which fail 
because they fix on specific edit instructions before the two files are completely known. 

Let D[ij\ be the edit distance between the first i symbols of A(denoted A[1 :«]) and the 
first j symbols of B (denoted B[\:j]). D[ij] makes sense even when i or j is zero; for 
example D[i,0] is the edit distance between a string of i symbols and a string of 0 
symbols, which obviously equals i. These values are arranged as a matrix with 1 +m rows 
(one row giving the values D[0j] and one row for each entry of A) and 1 +n columns 
(one column giving the values D[i,0] and one for each entry of B). For example, if A - 


abcabba and B - cbabac, the matrix of edit distances'^ 


0 

1 

2 

3 

4 

5 

6 


1 

2 

3 

2 

3 

4 

5 

a 

2 

3 

2 

3 

2 

3 

4 

b 

3 

2 

3 

4 

3 

4 

3 

c 

4 

3 

4 

3 

4 

3 

4 

a 

5 

4 

3 

4 

3 

4 

5 

b 

6 

5 

4 

5 
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5 

6 

b 

7 

6 

5 

4 

5 

4 

5 
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c 

b 

a 

b 

a 

c 



In this example, the entry D[5, 4], which lies at the intersection of row 5 and column 
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4 , (row and column numbers start with 0 ) is the edit distance between abcab (the string 
labelling rows 1-5) and cbab (the string labelling columns 1-4). The value in that 
position is 3 because abcab can be transformed into cbab by deleting the leading a and 
the c from the first string, then inserting a c at the front, but there is no shorter edit 
script for the transformation. (There may be more than one shortest edit script for a 
given pair of strings; in this case, delete a and b and insert a b.) 

The above matrix exhibits several useful patterns that appear for any choice of the two 
input strings. The 1th row (for any /) begins with i, and each pair of adjacent values 
differs by 1. The same is true of columns. Diagonals, however, have a different 
structure. To make this precise, number the diagonals of D as follows. 


diagonal 2 
diagonal 1 X. 
diagonal 0 xX^ 
diagonal — 1 v 0 1. 2 3 4 5 6 

diagonal— 2^1 N 2 3^2 3 4 5 
^ 2^3 ^ 2. 3 X 2 3 4 

3 x 2 X 3N^4 3 

4 3 X 4 X 3 x 4 3 x 4 

5 4 3 S '4^3^4^5 

6 5 4 5 4 5 6 

7654 5 X 4 N 5 

The values of D along diagonal k begin with |*|, then jump to \k\ + 2, then to \k\ + 4, 

and so on. The validity of this pattern follows from another of particular algorithmic 

interest: all occurrences of a given value d are on diagonals —d, —d+ 2 , . . ., d— 2 and d. 
That is, the entries with value d lie along alternate diagonals in a band of half-width d, 
centered around diagonal 0. A formal proof of these observations is not given here but 
can be inferred from the algorithm’s proof of correctness given beow. 

The file comparison algorithm systematically constructs a solution by using three 
rules to fill values in the D matrix. Along with each value D[iJ], the algorithm 
accumulates an edit script of length D[ij] that converts.-l[l :;] to £[1 :j]. The algorithm 
begins operation bv determining all entries in the D matrix that are 0. As is ewdent, 
these entries are just the values D[i,i] on diagonal 0 where .4[£] = £[£] for all k S i. In 
other words, the algorithm starts by finding identical prefixes of A and B. Then the 
algorithm applies the three rules to determine all entries in D that equal 1 . Then it fills in 
the 2s, then the 3s and so on. This continues until the ‘south-east’ value D[m,n] is 
determined, at which point the algorithm has found a shortest possible edit script for 
converting the first input string to the second. In specifying the rules, the notationA[f] 
denotes the fth symbol of A and B[f\ denotes theyth symbol of B. 


Rule 1: ‘Move right’ 

Suppose that: 

(i) D[i,/-1] (the value just to the left of D[i,j]) is known. 

(ii) An edit script of length £>[i,y- 1] that converts A[1 : /'] tofi[l :j— 1] is known. 

(iii) D [f, j] is unknown. 
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Then D[i,j] = D[i,j— 1] + 1, and adding the command ‘Insert B[j] after symbol /’ 
to the edit script of (ii) produces a shortest edit script for converting A[ 1 : i] to 

i/the algorithm has determined the value of D[i,j— 1], but not that of D[i,j], then 
D[i,j] must be greater than D[ij- 1]. Consequently, D[iJ] must equal D[/j-l] + l, 
since the rule shows how to construct a script of this length. 

As an example, consider Z)[3,3], the edit distance from abc to cba in the sample 
problem. Suppose that (i) D[3,2], the edit distance from abc to cb, is 3, that (n) 

Delete symbol 1 
Delete symbol 2 
Insert b after symbol 3 

is a shortest possible edit script for converting abc to cb, and that (iii) D[3,3] has not 
been filled in. Rule 1 then asserts that £>[3,3] must equal 4, and appending the command 
‘Insert a after symbol 3’ to the edit script given in (ii) yields a shortest edit script for 

converting abc to cba. 

Edit commands refer to symbol positions in the original string. Thus the second 
delete command in the script above removes the b in abc and not the c. Also, a sequence 
of insertions after the same position are assumed to occur in order. Thus the first insert 
command in the script above places a b after the c in abc and the second Insert after 
svmbol 3’ places an a after the b just inserted. These ‘parallel’ scripts are not sequentially 
executable bv common text editors but are oriented towards users, who cannot retain the 
intermediate states of a long edit. None the less, these scripts perform the desired 
transformation if executed sequentially in reverse order. 


Rule 2: ‘Move down’ 

Suppose that: 

(i) D[i—\,j] (the value just above D[i,j]) is known. 

(ii) An edit script of length D[i-\,j] that converts A[l: f-1] to £[l:/j is 
known. 

(iii) D[i,j] is unknown. 

Then D [I,/] = D[i—\,j] + 1 , and adding the command ‘Delete symbol i to the edit 
script of (ii) produces a shortest edit script for converting A[1 : f] to 

As an example, again consider £>[3,3] in the sample problem, which is the edit 
distance from abc to cba. Suppose that (i) £>[2,3], the edit distance from ab to cba, is 3, 
that (ii) 

Delete symbol 1 
Insert c after symbol 1 
Insert a after symbol 2 

is a shortest possible edit script for converting ab to cba , and that (iii) D[ 3,3] has not 
been filled in. Rule 2 then asserts that D [3,3] = 4, and appending the command ‘Delete 
svmbol 3’ to the edit script given in (ii) yields a shortest edit script for converting abc to 
cba. 
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Rule 3. ‘Slide down the diagonal’ 

Suppose that 

(0 1 ]) (the value just above and to the left of D[ij}) is known. 

(u) An edit script of length D[i — 1,_/—1] that converts A[1 : z'-l] tofifl : ;'-l] is 
known. 1 

(in) A[i\ = B[j}. 

Then D[i,j] = — and the edit script of (ii) is a shortest edit script for 

converting A[ 1 : i] to B[ 1 . 

As an example, look at £>[5,4] in the sample problem. Rule 3 savs that any edit script 
of length £>[4,3] for abca and cba, such as 



Delete symbol 1 
Delete symbol 2 
Insert a b after symbol 3 

is an edit script for abcab and cbab of length £>[5,4] 

For an illustration of how values of D are determined, suppose that the Os, Is and 2s 
have been filled in the sample matrix of edit distances: 



Now apply Rules 1-3 to see how far 3s extend down diagonal 1 . Rules 1 and 2 imply that 
any unfilled position that is immediately to the right of, or immediately below, a 2 must 
contain a 3. Thus these rules place 3s in positions [1,2], [2,3] and [3,4], Rule 3 then 
permits a 3 to fill in the [4,5] location, since the row and column labels there are both a, 
and since this position has a 3 immediately to its north-west. The same sort of reasoning 
fills in the 3s shown on the next page in diagonals —3, —1 and 3. 

The example illustrates that the algorithm finds the d entries of the D matrix by moving 
right or down from a d— 1 entry and then making a sequence of diagonal moves. It follows 
that d values are filled in only on diagonals — d, — tf 4-2, . . . ,d — 2 ,d. As already mentioned, 
all Os go on diagonal 0. The Is are filled in by either moving down from diagonal 0 to 
diagonal -1, or moving right from diagonal 0 to diagonal 1, then sliding down a diagonal. 
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c 
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a 

b 

a c 



For the 2s, the algorithm moves right or down from diagonals ± 1 , from which it can reach 
only diagonals —2, 0 and 2, then slides diagonally. Continuing inductively shows the 
general pattern to be true. 

For efficiency, not all the entries of D are explicitly computed and stored. The 
algorithm only records those d entries nearest the bottom of D in each relevant diagonal 
(i.e. diagonals —d, —d 4-2, . . ., d—2,d). The notation last^d[k] indicates the last row 
containing the most recent value of d to be filled in diagonal k. This simplification is 
valid since the sought-after corner D\m,n ] is such an entry. Furthermore, since the 
diagonals containing (d— l)s alternate with diagonals containing ds, the algorithm can 
use the same array to hold both the (d — 1) entries and the d entries it is computing from 
them. 

Assuming that !ast—d[k— 1] and last-d[k+ 1] indicate the positions of the last (d— l)s 
on diagonals k — 1 and k 4- 1 , how is the last d on diagonal k located ? The first problem is 
to find a value of d on diagonal k. One possibility is to move right from the last d— 1 in 
diagonal k — l to the row last—d\k — 1] on diagonal k. The other possibility is to move 
down from the last d- 1 in diagonal k + 1 to the row last-d[k+ I] 4- 1 on diagonal k. The 
algorithm selects the more advantageous of the two moves. Thus if last-d[k + 1] > 
last-d[k— 1] it moves down, otherwise it moves right. (Special care is needed for k = 
±d, i.e. for the diagonals that have (c£ — l)s on only one side.) Once on diagonal k, the 
algorithm slides as far as is permitted by Rule 3. 

For an example, return to the problem of filling 3s in diagonal 1 of the sample 
problem. Just after the 2s have been filled in, /<zsf_c/[0] = 2 and last~d[ 2] = 2. Moving 
right from diagonal 0 would yield a 3 in row 2 of diagonal 1 , whereas moving down from 
diagonal 2 yields a 3 in row 3. Hence the algorithm applies the move down rule to reach 
position [3,4], slides down to row 4 using Rule 3, and sets last_d[J] to 4. 

Given this method of determining edit distances, producing a corresponding edit 
script is straightforward. An edit script denoted script[k] is associated with each 
last-d[k], and updated according to Rules 1-3. The algorithm fragment that locates the 
last d on diagonal k is given in Figure 1. 

The test 


if (k = — d or (k 4 d and lasL-d[k+ 1] a last_d[k— 1])) 
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/* Find a d on diagonal k. *1 _ t 

if (k = -d or (k + d and last_d[k+l] ^ last_d[k-l])){ 

I* Moving down from the last d—1 on diagonal k + l *1 
I* puts you further along diagonal k than does *1 
I* moving right from the last d—1 on diagonal k—l. *1 
row = last_d[k+l] + 1 

script [k] = script[k+l] with the command 
‘Delete the rozcth symbol’ appended 

} else { 

/* Move right from the last d-1 on diagonal k-1. / 
row = last_d[k — 1] 

script [k] = script[k— 1] with the command 

‘Insert B[row+k] after the nmith symbol’ appended 

} 

/* Column where row intersects diagonal k *1 
col = row 4- k 

/* Slide down the diagonal. *1 . 

while (row < m and col < n and A[row+l] -B|col+lJ) 1 
row = row + 1 
col = col 4- 1 

} 

last_d[k] = row 


Figure I. Locating the last d on diagonal k 


takes special care with the cases k = ± d. When k is -d, the algorithm is working on the 
lowest diagonal that contains ds. The diagonal above it (diagonal A + l) contains one or 
more {d- l)s, but the diagonal just below (diagonal £-1) contains none, and tot 
d[k- 1] is not defined. The clause ‘k = -d’ guarantees that the move down rule wifi 1 be 
applied. Similarly, the clause ‘k 4 d’ guarantees that the move right rule is applied when 

k The'next step is to enclose the above algorithm fragment in loops that vary d and k 
appropriately. Perhaps the first pair of looping statements that one might try is. 

for d = 1,2,3, . . . 

for k = — d, — d+2, - . . , d— 2, d 

However, an alternative approach (Figure 2) helps to guarantee that ang f reference 
stav in bounds and further restricts the ‘search band of the algorithm For examp , 
suppose row reaches m, meaning that the algonthm has hit the bottom of the D > ma ^ 
sav on diagonal k. It is pointless to fill in values to the left of diagonal k, smce they 
cannot contribute to the value of D[m,n] . The algorithm arranges that diagonal k + 1 w 
be the lowest diagonal that is considered when d is incremented. This is done by using a 
pair of bounds, Iwer and upper , that give the range of diagonals to consjeia 
iteration. Normally, lower is decremented by one and upper is / °a lets 
each iteration. However, in the situation above the algonthm sets louei to k + 2 and lets 
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/* Initialize: 0 entries in D indicate identical prefixes *1 
row = min (i:A[i+l] =£ B [i + 1]) 

last d [0] = row 

script [0] = NULL 

if (row = m) lower = 1 else lower = — 1 
if (row = n) upper = — 1 else upper = 1 
if (lower > upper) 

Report that the files are identical and terminate execution 

I* for each value of the edit distance *1 
for d = 1,2, . . . , max_d { 

I* for each relevant diagonal *1 

for k = lower, lower + 2, . . . , upper — 2, upper { 

Locate the last d on diagonal k, as in Figure 1. 

if (row = m and col = n) 

Print the edit script pointed to by script [k] 
and terminate execution, 
if (row = m) 

/* Hit last row; don ’t look to the left. */ 
lower = k + 2 
if (col = n) 

/* Hit last column; don’t look to the right. *1 
upper = k — 2 

} 

lower = lower — 1 
upper = upper + 1 

} 

Print a message indicating that the edit distance is greater than max_d. 

Figure 2. The high-level structure of the algorithm 


the default decrement set it correctly to k+ 1 . The bound upper is similarly set when col 
reaches n on a given diagonal. 

The algorithm terminates after max-d iterations of the outermost loop. If there are no 
edit scripts of length at most maxul, then the algorithm quits and reports this fact. This 
characteristic is desirable in the common case that one does not want to see the 
difference if it is too large to be useful. If max-d is set to m + n, the algorithm finds the 
difference regardless of its size. 

An implementation of the complete algorithm in C is given in the Appendix. 

ANALYSIS OF THE ALGORITHM 


Correctness 

The algorithm employs the simple strategy of finding each d entry by moving right or 
down from a d— 1 entry and then making a sequence of diagonal moves. It has yet to be 
shown that this strategy correctly fills in the D matrix and hence that the algorithm is 
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correct. An inductive argument is needed to show that at stage d every d entry in the D 
matrix is found by the strategy. In effect, an argument about all edit scripts is required. 
Certainly, scripts that insert and subsequently delete a given symbol are not optimal and 
need not be considered. For the remaining scripts, assume without loss of generality that 
the commands of a script are ordered according to the position in .1 affected by the 
command. The scripts produced by Rules 1-3 have this property. 

Suppose that 5 is a shortest edit script of length d for entry D[i,j ] . For the basis, t/=0, 
it follows that i-j and.4[£] = B[£] for all k S i. But the algorithm correctly fills in these 
entries in its initialization step. Proceeding inductively, suppose that d > 0 and let R be 
the first d- 1 commands of S. The algorithm’s strategy is correct if R is a shortest script 
for an entry D[p,q] that can reach D[ij ] via a move right or down and a sequence of 
diagonal moves. Suppose that the last command of S is ‘Delete symbol k\ where k is 
between 1 and 1. Then the last i — k symbols of A[1 :!] must match the last i k symbols of 
B[1 :/] . ThusS must also be a shortest script for converting A[1 :A] into B[1 :j-(i-k)\. If 
it were not, then the shorter script would, by Rule 3, be a shorter script for D\i,j], which 
is a contradiction. It then follows that R must be a shortest script for converting 
A[1 1] into B[l:j-{i~k)]. Again, if it were not then the shorter script plus the 

command ‘Delete symbol k' would, by Rules 1 and 3, be a shorter script for D[/,y], 
which is a contradiction. But entry D[i,j] can be reached from D[k-\, j-(i~k)] by a 
move down and a sequence of diagonal moves. Similar reasoning reveals that if the last 
command of S is ‘Insert an x after symbol k' where k is between 1 and i and x is 
arbitrary, then R is a shortest script for converting .4[1:A] to B[\ :j— (i — k)— 1]. But 
again, entry D [fj] can be reached from D[k,j-(i-k)~ 1] by a move right and a sequence 
of diagonal moves. Thus the strategy and algorithm are correct. 

A more detailed discussion of the formal issues for the edit distance problem and its 
generalizations is given in Reference 9. 

Efficiency 


I 



At worst, the algorithm must determine all the entries of D, hence its running time is 
at worst proportional to mX/i. This is expected, since it is known that for every one of a 
certain large class of file comparison algorithms there exists some pair of files for which 
the algorithm takes time proportional tomXn. 10 

The advantage of the algorithm is that it performs efficiently when the size of the 
output is small compared to m and n. Since the algorithm stops as soon as D[m,n] is 
filled in, only diagonals -d to d are considered, where d = D [/«,«]. Thus the running 
time is proportional to the number of entries on those diagonals, which is less than 
(ld+\)mm{m,n). Moreover, the expected running time is proportional to min (m,n) + 
d z under appropriate distributional assumptions. 11 

Most file comparison algorithms do not perform this well when the two files are nearly 
identical. Some take time proportional to mXn regardless of d. lz Other algorithms 
depend heavily on the number r of pairs [ij] where A[/] equals B\j], which may be large 
in cases where d is small. 13 13 14 

An algorithm whose performance depends on r is used in the UNIX diff command. 

To show how disastrous this dependence on r can be, diff was compared to f comp, the 
new algorithm’s implementation as listed in the Appendix. Both programs were run with 
file l consisting of 1000 blank lines, and file 2 consisting oifile 1 with a single non-blank 
line added to both ends. This choice makes d = 2 and r = 10'’. Th e fcomp program took 
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about half a second of computer time (on a VAX 11/780), whereas diff took over 2-5 
minutes. 

To turn the tables, the two programs were run on a pair of 1000-line files with no lines in 
common. This choice makes d = 2000 and r = 0 \ fcomp ran out of memory after one 
minute (and reported that the edit distance was at least 617), whereas diff solved the 
problem in about 8 seconds. (This mode of failure for fcomp can be avoided since the 
algorithm can be modified to run in space proportional to /n+n. 11 ) Thus there exist 
pathological cases where fcomp’s performance is much worse than diff s, and vice versa. 

To obtain a more realistic idea of how fcomp compares with diff , figures were gathered 
for their relative performance comparing 1000-line files of C programs, with various 
values of d between 5 and 50. Typical values of r were 10,000^-20,000. (If one out of 
every 10 lines were blank, then the pairs of blank lines, one from each file, would 
contribute 10,000 to r.) On these problems, fcomp typically ran about 4 times faster than 
diff. The only circumstances in which this trend was broken were cases when all 
differences between the files fell in a small range of lines, diff begins operation by 
stripping away lines that match at the fronts and rears of the two files. For example, if all 
differences between the two files occur in lines 1-100, diff quickly reduces the problem 
to that of comparing two 100-line files, then applies its main algorithm. For such 
problems, fcomp ran about 2-3 times faster than diff. 

It is notoriously difficult to judge the relative merits of two programs, since 
performance often depends critically on the data and the programming details. To 
evaluate the effect of coding differences, diff' s underlying algorithm was implemented in 
the spirit of fcomp. In particular, each line of the two files was saved using a call to the 
storage allocation routine malloc. (To conserve space, diff stores internally only the lines’ 
hash values. This strategy costs the time to compute those values and to read each file 
twice, but speeds the process of comparing lines.) Again, fcomp was more efficient by a 
factor of around four. The program listing given in the Appendix will facilitate efforts to 
corroborate, extend or invalidate this experimental conclusion. 
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APPENDIX 

The following program pp. 1036-1039 implements the file comparison algorithm in C. 
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r 

• fcomp - a file comparison program 


/• Initialize: 0 entries in D indicate identical prefixes. 7 

for (row = 0; row < m && row < n && strcmp(A[row], B[row]) = 0; ++row) 


* A command line has the form 

* fcomp [-n] filel file2 

* where the optional flag n is an integer constant that limits the size of 

* edit scripts that will be considered by fcomp. If all edit scripts changing 

* filel to file2 contain more than n insertions and deletions, then a message 

* to that effect is all that is printed. If no n is specified, then arbitrarily 

* long edit scripts are considered. 

7 


last_d[ORIGIN] = row; 
script[ORIGIN] = NULL; 

lower = (row = m) ? ORIGIN + 1 : ORIGIN - 1; 
upper = (row = n) ? ORIGIN - 1 : ORIGIN + 1; 
if (lower > upper) { 

puts("The files are identical."); 
exit(O): 

1 


#include <stdio.h> 

#define MAXLINES 2000 
#define ORIGIN MAXLINES 
#define INSERT 1 
#define DELETE 2 


/* maximum number of lines in a file 7 
/* subscript for diagonal 0 7 


/* edit scripts are stored in linked lists */ 


struct edit { 

struct edit ‘link; 
int op; 
int linel; 
int Iine2; 


}; 

char *A[MAXLINES], *B[MAXLINES]; 


/* previous edit command */ 

/* INSERT or DELETE 7 
/* line number in filel 7 
/• line number in file2 7 

/* pointers to lines of filel and file2 */ 


main(argc. argv) 



int argc; 



char ‘argvQ; 



( 

int 

max_d, 

/■ 


m, 

/■ 


n, 

/■ 


lower. 

r 


upper, 

r 


d, 

r 


k, 

r 


row, 

r 


col; 

r 


bound on size of edit script 7 

number of lines in filel */ 

number of lines in file2 7 

left-most diagonal under consideration */ 

right-most diagonal under consideration 7 

current edit distance */ 

current diagonal 7 

row number */ 

column number 7 


/* for each diagonal, two items are saved: 7 

int last_d[2*MAXLINES+1]; /* the row containing the last d 7 

struct edit *script[2*MAXLINES+1]; /* corresponding edit script 7 


struct edit ‘new; 
char *malloc(); 

if (argc > 1 && argv[1][0] = '-') { 
max_d = atoi(&argv[1][1]); 

++argv; 

—argc; 

) else 

max_d = 2*MAXLINES; 
if (argc 1= 3) 

fatal("Fcomp requires two file names."); 
/* Read in filel and file2. 7 
m = in_file(argv[1]. A); 
n = in_file(argv[2], B); 



/* for each value of the edit distance 7 
for (d = 1; d <= max_d ; ++d) { 

/* for each relevant diagonal 7 
for (k = lower; k <= upper; k += 2) { 

/* Get space for the next edit instruction. */ 
new = (struct edit ') malloc(sizeof(struct edit)); 
if (new = NULL) 
exceed (d); 

/* Find a d on diagonal k. 7 

if (k = ORIGIN-d || k != ORIGIN+d && last_d[k+1] >= last_d[k-1]) { 

r 

‘ Moving down from the last d-1 on diagonal k+1 
‘ puts you farther along diagonal k than does 
‘ moving right from the last d-1 on diagonal k-1. 

7 

row = last_d[k+1]+1; 
neW->link = script[k+1]; 
new->op = DELETE; 

} else { 

/* Move right from the last d-1 on diagonal k-1. 7 
row = last_d[k— 1]; 
new->link = script[k-1]; 
new->op = INSERT; 

} 

/• Code common to the two cases. */ 
new->line1 = row; 

new->line2 = col = row + k - ORIGIN; 
script[k] = new; , 

/* Slide down the diagonal. */ 

while (row < m && col < n && strcmp(A[row], B[col]) = 0) { 

++row; 

++col; 

} 

last_d[k] = row; 

if (row = m && col = n) { 

r Hit southeast corner; have the answer. */ 

put_scr(script[k]); 

exit(0); 

} 

if (row = m) 

r Hit last row; don't look to the left. */ 
lower = k+2; 
if (col = n) 

/* Hit last column; don't look to the right. 7 
upper =-k-2; 
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exceed (d); 

} 

/* in— file - read in a file and return a count of the lines 7 

in_file(filename, P) 
char ’filename, *PQ; 

char buf[100], *malloc(), *fgets(). ’save, # b; 

FILE 'fp, *fopen(); 
int lines = 0; 

;# (/fn = fopen(filename, "r")) * NULL) { 

fprintf(stderr, "Cannot open file %s.\n , filename). 
exit(1); 

while (fgets(buf, lOO.fp) != NULL) { 
if (lines >= MAXLINES) 

fatal ("File is too large for fcomp. ); 
if ((save = malloc(strlen(buf)+1)) = NULL) 

fatal("Not enough room to save the files. ), 
P[lines++] = save; 

for (b = buf; *save++ = *b++; ) /* copy the line / 


fclose(fp); 

return(lines); 


I* put_scr - print the edit script V 

put_scr(start) 

struct edit 'start; 

struct edit *ep, 'behind, 'ahead. *a. *b; 
int change; 

/* Reverse the pointers. 7 
ahead = start; 
ep = NULL; 

while (ahead != NULL) { 
behind = ep; 
ep = ahead; 
ahead = ahead->link; 
ep->link = behind; /* Flip 

1 


/• Flip the pointer. 7 


/* Print commands. 7 
while (ep != NULL) ( 
b = ep; 

if (ep->op = INSERT) 

printfC'Inserted after line %d:\n , ep->line1), 

SlSe 1 /• Look for a block of consecutive deleted lines. 7 

do ( 

a = b; 

1 while (b!=NULL n && b->op=DELETE && b->line 1 =a->linein); 
/• Now b points to the command after the last deletion. / 
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change = (b!=NULL && b->op=INSERT && b->line1=a->line1); 
if (change) 

printf("Changed "); 

else 

printf (" Deleted "); 
if (a = ep) 

printf("line %d:\n". ep->line1); 

GlSe printf ("lines %d-%d:\n", ep->line1. a->line1); 

/• Print the deleted lines. 7 
do ( 

printf!" ‘K’S", A[ep->line1-1]); 
ep = ep->link; 

} while (ep != b); 
if (Ichange) 

continue; 

printf("To:\n"); 

/• Print the inserted lines. 7 
do { 

printf(” %s". B[ep->line2-1]); 

) While (epVSuLL && ep->op = INSERT && ep->line1 = b->line1); 

1 

) 

/* fatal - print error message and die 7 
fatal (msg) 
char *msg; 

fprintf(stderr, "%s\n", msg); 
exit(l); 

} 

/' exceed - the difference exceeds d 7 
exceed (d) 
int d; 

1 fprintf(stderr, "The files differ in at least %d linesAn", d); 

exit(l); 

) 
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PRTDS — A Pascal Run-Time Diagnostics 
System 

N. H. WHITE AND K. H. BENNETT 

Department of Computer Science, University of Keele, Keele, Staffordshire, STS 5RG, L .K. 


SUMMARY 

A run-time diagnostics system has been described 1 that allows the user to interrogate a program 
fully in source language terms. In particular, dynamically created data structures are catered for. 
The diagnostics system allows all objects including those created dynamically to be interrogated 
and can display a graphical representation of linked data structures. This paper describes the 
implementation of such a system with the language Pascal on a GEC 4080. 

key words Pascal Data structures Diagnostics 


INTRODUCTION 

The diagnostics program described is implemented on the GEC 4080s at Keele 
University. The GEC 4080 2 contains nine registers of which four can be considered as 
general purpose. A program has direct access to sixty-four kilobytes of store arranged as 
four segments of up to sixteen kilobytes each. By overlaying nominated areas of store 
into these four segments, a program may access up to four Megabytes of memory. 

The Pascal system is based on the P4 Pascal compiler. 3 This compiler produces 
Pcode 4 which is then interpreted. Both the compiler and resultant compiled Pascal 
programs originally existed as Pcode instructions. The compiler has since been 
hand-coded into GEC 4080 machine code. This implementation is used as the principal 
system for the teaching of programming in the Computer Science department and is 
heavily used locally. In addition to providing a Pascal system locally, this implementa- 
tion was written in order that the described diagnostics package could be evaluated. 


OPERATION OF PRTDS 

The diagnostics program PRTDS is written entirely in Pascal and, when invoked, is 
interpreted alongside the user program. The Pcode interpreter then operates in one of 
two states — user state or PRTDS state. 

In user state, the Pcode machine’s code .and data areas are set up to refer to the 
respective areas in store corresponding to the user program. In this way the system runs 
largely as it would were PRTDS not in existence. In PRTDS state, the Pcode machine s 
code and data areas refer to the PRTDS program. For the most part no distinction is 
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